43 research outputs found

    Data Vaults: Database Technology for Scientific File Repositories

    Get PDF
    Current data-management systems and analysis tools fail to meet scientists’ data-intensive needs. A "data vault" approach lets researchers effectively and efficiently explore and analyze information

    Lazy ETL in Action: ETL Technology Dates Scientific Data

    Get PDF
    Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data. We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly. The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading

    Instant-on scientific data warehouses: Lazy ETL for data-intensive research

    Get PDF
    In the dawning era of data intensive research, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data

    Data Vaults: a Database Welcome to Scientific File Repositories

    Get PDF
    Efficient management and exploration of high-volume scientific file repositories have become pivotal for advancement in science. We propose to demonstrate the Data Vault, an extension of the database system architecture that transparently opens scientific file repositories for efficient in-database processing and exploration. The Data Vault facilitates science data analysis using high-level declarative languages, such as the traditional SQL and the novel array-oriented SciQL. Data of interest are loaded from the attached repository in a just-in-time manner without need for up-front data ingestion. The demo is built around concrete implementations of the Data Vault for two scientific use cases: seismic time series and Earth observation images. The seismic Data Vault uses the queries submitted by the audience to illustrate the internals of Data Vault functioning by revealing the mechanisms of dynamic query plan generation and on-demand external data ingestion. The image Data Vault shows an application view from the perspective of data mining researchers

    Instant-on scientific data warehouses lazy ETL for data-intensive research

    No full text
    In the dawn of the data intensive research era, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data
    corecore